feat: support lp_norm operator on metax by LindseyMei · Pull Request #1359 · InfiniTensor/InfiniCore

LindseyMei · 2026-07-01T07:55:38Z

Add MetaX backend for lp_norm.

Changes

Create src/infiniop/ops/lp_norm/metax/lp_norm_metax.h and src/infiniop/ops/lp_norm/metax/lp_norm_metax.maca, reusing ../cuda/kernel.cuh.
Wire MetaX dispatch into operator.cc (CREATE / GET workspace size / CALCULATE / DESTROY).
Make lp_norm/cuda/kernel.cuh compile on MACA by excluding MetaX from the CUDART_VERSION >= 12090 ::cuda::maximum() branch; MetaX uses the cub::Max() / lambda fallback.

Validation

python3 test/infiniop/lp_norm.py --metax passes on MetaX C500 for F16/BF16/F32, including contiguous and strided axis=-1 cases.
clang-format --dry-run --Werror clean on all 4 modified/new files.
Build: xmake -y -j4 + xmake install -y succeeded.

Notes

This PR is based on the current main branch.
The CUDART guard change is a minimal portability fix; the scalar cub::Max() path is already used by other MetaX operators (e.g. reduce/cuda/reduce.cuh).

Add MetaX backend for lp_norm: - Create metax/lp_norm_metax.h and metax/lp_norm_metax.maca - Wire MetaX dispatch into operator.cc (CREATE/GET/CALCULATE/DELETE) - Make lp_norm/cuda/kernel.cuh compile on MACA by excluding MetaX from the CUDART_VERSION >= 12090 ::cuda::maximum() branch Validation: python3 test/infiniop/lp_norm.py --metax passes on MetaX C500 (F16/BF16/F32, contiguous and strided axis=-1 cases). Signed-off-by: LindseyMei <648816901@qq.com>

LindseyMei · 2026-07-01T08:07:50Z

MetaX lp_norm Performance Report

Environment

GPU: MetaX C500 64GB
MACA: 3.3.0.15
Repository: InfiniTensor/InfiniCore
Branch: feat/metax-lp-norm
Commit: 27f2ef05
Benchmark script: /data/workspace/lp_norm_bench.py
Method: explicit torch.cuda.synchronize(), 10 warmup + 300 iterations

What changed

Added a MetaX backend for the lp_norm operator. The implementation reuses the existing lp_norm/cuda/kernel.cuh with a minimal portability fix for MACA (exclude MetaX from the CUDART_VERSION >= 12090 ::cuda::maximum() branch so it falls back to cub::Max() / lambda reduction).

Correctness

python3 test/infiniop/lp_norm.py --metax passes on MetaX C500 for F16/BF16/F32, covering:

contiguous tensors with axis = 0 / 1 / 2 / 3
strided tensors with axis = -1
inplace / out-of-place

Performance (libinfiniop)

shape	axis	p	dtype	lib_ms	Gelem/s	GB/s
(4096, 4096)	-1	2	F16	0.587	28.57	114.3
(4096, 4096)	-1	2	F32	0.594	28.23	225.9
(4096, 4096)	-1	2	BF16	0.600	27.96	111.8
(4096, 4096)	-1	1	F16	0.596	28.13	112.5
(4096, 4096)	-1	1	F32	0.595	28.21	225.7
(4096, 4096)	-1	1	BF16	0.601	27.93	111.7
(8192, 8192)	-1	2	F16	1.620	41.43	165.7
(8192, 8192)	-1	2	F32	1.653	40.59	324.7
(8192, 8192)	-1	2	BF16	1.695	39.60	158.4
(8192, 8192)	-1	1	F16	1.638	40.97	163.9
(8192, 8192)	-1	1	F32	1.665	40.32	322.5
(8192, 8192)	-1	1	BF16	1.694	39.61	158.4
(16384, 16384)	-1	2	F16	5.570	48.19	192.8
(16384, 16384)	-1	2	F32	5.714	46.98	375.8
(16384, 16384)	-1	2	BF16	5.847	45.91	183.6
(16384, 16384)	-1	1	F16	5.608	47.87	191.5
(16384, 16384)	-1	1	F32	5.694	47.14	377.1
(16384, 16384)	-1	1	BF16	5.789	46.36	185.5
(12, 16, 512, 512)	3	2	F16	1.181	42.62	170.5
(12, 16, 512, 512)	3	2	F32	1.224	41.13	329.0
(12, 16, 512, 512)	3	2	BF16	1.236	40.71	162.9
(12, 16, 512, 512)	3	1	F16	1.178	42.74	171.0
(12, 16, 512, 512)	3	1	F32	1.235	40.76	326.1
(12, 16, 512, 512)	3	1	BF16	1.233	40.82	163.3

Observations

lp_norm is a reduction-style normalization (max reduction + p-norm reduction + elementwise scale), so its effective bandwidth is lower than pure elementwise kernels, as expected.
F32 reaches ~375 GB/s on the largest 2D tensor, indicating the kernel is memory-bandwidth-bound for large reductions.
F16/BF16 bandwidth is roughly half of F32 because the kernel accumulates in float and does not yet use half-precision vector loads; this is consistent with the current CUDA implementation.
Performance is stable across p=1 and p=2.

Known limitations

The shared kernel accumulates intermediate results in float regardless of input dtype, which limits F16/BF16 bandwidth.
No warp-level vectorization is applied yet; future work could explore half2 / float4 loads for the elementwise scale loop.

LindseyMei requested a review from a team July 1, 2026 07:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: support lp_norm operator on metax#1359

feat: support lp_norm operator on metax#1359
LindseyMei wants to merge 1 commit into
InfiniTensor:mainfrom
LindseyMei:feat/metax-lp-norm

LindseyMei commented Jul 1, 2026

Uh oh!

LindseyMei commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

LindseyMei commented Jul 1, 2026

Changes

Validation

Notes

Uh oh!

LindseyMei commented Jul 1, 2026

MetaX lp_norm Performance Report

Environment

What changed

Correctness

Performance (libinfiniop)

Observations

Known limitations

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant